All artwork by Allison Horst (@allisonhorst) unless otherwise noted.
Why should you learn R?
It’s…
totally free!
open-source
a scripting language (reproducible and transparent)
an established tool with tons of tutorials and help pages
an amazing and inclusive community
Source unknown
R = programming language that can be written directly into your computer’s console/terminal
R-Studio = integrated development environment (IDE) with lots of features
Almost all R-users have both installed
Four panels:
Top-left: scripting panel – write and save code (may only open when you open a new file)
Bottom-left: console panel – input/output that won’t be saved
Top-right: environment panel – dataframes, variables (and sometimes has other features too, depending on what you’re working with)
Bottom-right: install and update packages, preview plots, read help files, and some other features
File -> New File -> R Script
simple: only code
write lines of code in the order they are meant to be run
run each line by clicking that line and pressing Control + R (on Windows) or Cmd + Enter (on Mac)
output shows below, in console panel (number in bracket shows amount of output if there is more than one line)
Try it out: 5 + 5
comments that are not read as code: #
File -> New File -> R Markdown
R-Markdown: more complex, both code and text, can also include images and some basic formatting (## for headers)
YAML information at top:
Run code: - press Control + R (on Windows) or Cmd + Enter (on Mac) with your cursor in the line - use green right-pointing arrow at right edge of code block - arrow with line under it -> runs all code chunks above (but not the current one)
Output: - directly below code block in preview window
Add a new (R) code block below this text. One of R’s most basic functions is as a (fancy) calculator. Try the following operations: 542 + 345, 67 * 29, 143/3
What happens if you chain together multiple operations? Try: 12 + 2 * 2 and (12 + 2) * 2 in an additional code block
Can you figure out what the following symbols mean in R: ^, %% ? Add another code block and give them a try. (Hint: %% is related to /)
Variables: give name/label to output, saves it (can be used anywhere below in current script)
name_of_variable <- code
For example:
## [1] 36
## [1] 19
## [1] 24
Variables are case sensitive (caps/lowercase matters) and can’t include spaces.
Commonly snake_case
You can preview your current variables in the R-Studio environment pane
You are planning a pizza party and want to figure out how many pieces of pizza each of your guests can eat before the pizzas run out. Create a variable called “guests” and assign it to the number of guests you have at your party: 15. (First, make a code block under this question!)
You order 5 pizzas. Assign the variable “pizzas” to the amount of pizzas you have: 6. Each pizza has 12 slices. Assign the variable “slices” to the amount of slices you have over all pizzas (hint: multiple the slices per pizza by the total number of pizzas – you can use the pizzas variable here instead of retyping the number)
Divide your “slices” variable by your “guests” variable to figure out how many slices each person can have.
5 more people arrive to the party uninvited. Update your “guests” variable to the total number you have now. How many slices can each person have now?
Source unknown
If you want to have multiple items of the same type: - Use a vector – enclosed in the syntax c()
length() returns number of items in the vector## [1] 5
You can also do “vector-wise” operations:
## [1] 14 16 28 59 84
## [1] 144 196 676 3249 6724
## [1] 1.2 1.4 2.6 5.7 8.2
Mathematical operation is “broadcast” to each item
Add/subtract/multiply/divide a vector by another vector: items matched up one at a time in a loop
## [1] 24 28 52 114 164
You can also take a certain item from a vector using square brackets. For example, take the first item:
## [1] 12
c() syntax!)class()## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
## [1] "numeric"
as.integer()
## [1] 8
## [1] 8
## [1] "integer"
## [1] "character"
Uncomment the following line by deleting the hashtag (#). Check out the error here – it may sound confusing but try to identify what it is telling you.
## [1] "character"
If you forget the quotes, you’ll also get an error (have to remove the comment character here to see it!)
Source unknow
test if something is identical with == test if something is not identical with != test if something is greater than with > or less than with < or try greater than or equals to >= or less than or equal to with <=
## [1] TRUE
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] "logical"
Source unknow
factor - string variables that should be treated as categories / distinct levels of a grouping -> will become more relevant when we work with datasets
fav_flavors <- c("chocolate", "vanilla", "strawberry", "strawberry", "chocolate", "mint chocolate chip", "chocolate", "vanilla", "strawberry", "chocolate")
class(fav_flavors)## [1] "character"
## Length Class Mode
## 10 character character
Change to factor with as.factor()
## [1] "factor"
## chocolate mint chocolate chip strawberry vanilla
## 4 1 3 2
Can also see the possible groupings of a factor using levels()
## [1] "chocolate" "mint chocolate chip" "strawberry"
## [4] "vanilla"
nchar() on your name variable. What does it do? Try also toupper().Bonus: What does TRUE == 1 return? What about FALSE == 1? Try them both with 0 too. What does this tell you about how R deals with logical variables?
Packages - extend basic R functions to add more complicated/specialized functionality
CRAN: R Consortium’s host for packages – any advanced R-programmer can make and submit their packages, any R-user can download and use them
Install packages once when you first use them
install.packages("packagename")Try it now with one of the most used packages by copying the following code into your console install.packages("tidyverse")
library() command at beginning of script to use calls from that package – will only work if you’ve already installed the package
Good style: load all packages in first code block
Error about not finding a package? You probably don’t have it installed.
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
In summary: You need to install a package once, but need to load it (using the library command) every time you start a new R session.
What error message do you get?
Now install the package “beepr” and call the package to this session with a library call somewhere above the beep command (line above is enough.) Try running the command again (make sure your speakers are on – but not too loud :D)
What type of file are you working with?
Where is the file saved?
working directory: the location that R should start looking for files – setwd()
setwd(~Users/kyla/Documents...) or setwd(C:/Documents...)R-Markdown automatically sets the working directory to the folder you’re in
once you’ve set the working directory:
Read in the file:
read.csv() and read.tsv()read_csv() and read_tsv()
library() callread_csv2() for csv files in languages that use a comma as a decimal separator and thus a semicolon as a csv separatorSave the output to a variable.
Source unknown
## Parsed with column specification:
## cols(
## participant = col_character(),
## height_cm = col_double(),
## jump_height = col_double()
## )
## Parsed with column specification:
## cols(
## participant = col_character(),
## height_cm = col_double(),
## jump_height = col_double()
## )
Now you have a data file read in, but how do you see what’s in it?
head(): shows first six rows
## # A tibble: 6 x 3
## participant height_cm jump_height
## <chr> <dbl> <dbl>
## 1 A 170 51
## 2 B 165 46
## 3 C 167 40
## 4 D 159 45
## 5 E 182 62
## 6 F 174 50
Can change the amount of rows with the n argument
## # A tibble: 3 x 3
## participant height_cm jump_height
## <chr> <dbl> <dbl>
## 1 A 170 51
## 2 B 165 46
## 3 C 167 40
Or: click name of dataframe in Environment tab - can also sort columns and Filter rows – just for viewing purposes - bit slow if you start having huge dataframes but often a good first look
summary(): call it on a dataframe to get each column and useful info based on the data type. For example, numeric columns will show the min, median, max and the quartiles (25% increments).
## participant height_cm jump_height
## Length:8 Min. :159.0 Min. :31.00
## Class :character 1st Qu.:164.2 1st Qu.:39.50
## Mode :character Median :166.0 Median :45.50
## Mean :168.0 Mean :45.38
## 3rd Qu.:171.0 3rd Qu.:50.25
## Max. :182.0 Max. :62.00
colnames() on the variable you saved the dataframe to. What does it do?head() command, now try the tail() command. What does it do?R-Studio will helpfully suggest the column names as soon as it sees the $
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
## [1] 170 165 167 159 182 174 162 165
Basic descriptive statistics for numeric columns – mean, median and range
## [1] 168
## [1] 159 182
## [1] 166
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 159.0 164.2 166.0 168.0 171.0 182.0
The penguins data contains information on several different penguin species. But how many, exactly? Let’s try to use ‘summary’ to extract that information:
## Parsed with column specification:
## cols(
## species = col_character(),
## island = col_character(),
## bill_length_mm = col_double(),
## bill_depth_mm = col_double(),
## flipper_length_mm = col_double(),
## body_mass_g = col_double(),
## sex = col_character(),
## year = col_double()
## )
## Length Class Mode
## 344 character character
class on a variable to have R show us which data type it is:## [1] "character"
## [1] "numeric"
as_factor()
## [1] "factor"
## Adelie Gentoo Chinstrap
## 152 124 68
The pipe %>% is a command from the magrittr package (which is included in tidyverse, so no need to load it explicitly). When the pipe is used, the output from the first line is automatically passed to the second line. Specifically, the output is used as the first argument of the next command. Use Control + Shift + M to automatically insert %>%
## species island bill_length_mm bill_depth_mm
## Adelie :152 Length:344 Min. :32.10 Min. :13.10
## Gentoo :124 Class :character 1st Qu.:39.23 1st Qu.:15.60
## Chinstrap: 68 Mode :character Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 Length:344 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
## Median :197.0 Median :4050 Mode :character Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
# is equivalent to:
penguins %>% # the line break here is optional, but makes the code more readable
summary()## species island bill_length_mm bill_depth_mm
## Adelie :152 Length:344 Min. :32.10 Min. :13.10
## Gentoo :124 Class :character 1st Qu.:39.23 1st Qu.:15.60
## Chinstrap: 68 Mode :character Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 Length:344 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
## Median :197.0 Median :4050 Mode :character Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
## # A tibble: 4 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## # … with 1 more variable: year <dbl>
## # A tibble: 4 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## # … with 1 more variable: year <dbl>
Look at the last three lines of the penguins data (hint: tail). Use the base R command, then try to rewrite it with the pipe %>%
Next time:
- tidyverse
- rename columns
- make new columns
- reorder
- filter based on condition
- etc.
As a follow-up to this tutorial, you might want to read chapters 3.3 - 3.9 in Navarro, Learning Statistics with R (p. 46 - 66).